A string literal or anonymous string is a literal for a string value in source code. Commonly, a programming language includes a string literal code construct that is a series of characters enclosed in bracket delimiters usually quote marks. In many languages, the text "foo" is a string literal that encodes the text foo but there are many other variations.
quotation mark is the most common way to delimit a string literal. Many languages support double-quotes (i.e. ) and/or single-quotes (i.e. ). When both are supported, delimiter collision can be minimized by treating one style of quotes as normal text when enclosed in quotes of the other style. In Python the literal is valid since the outer quotes are double, making the inner single quotes regular text.
An empty string is written as or .
Paired delimiters are two different types of characters where one is used at the beginning of a literal and the other used at the end. With paired delimiters, the language can support embedding quotes in the literal text as long as they all are paired and don't partially jump out of their own scope. For example, PostScript uses parentheses, as in (The quick (brown fox)) and m4, uses backtick at the start, and apostrophe at the end. Tcl allows both quotes and braces, as in "The quick brown fox" or {The quick {brown fox}}; this derives from the single quotations in Unix shells and the use of braces in C for compound statements, since blocks of code is in Tcl syntactically the same thing as string literals – that the delimiters are paired is essential for making this feasible.
Quotation is most commonly via unpaired quotes, but some tools and character sets support paired quotes. Unpaired quotes are quotes that don't have unique left-side-quote and right-side-quote variants, including "" or ''. The Unicode character set includes paired versions:
“Hi there!”
‘Hi there!’
„Hi there!“
«Hi there!»
In other words, "scope mixing" is where an unambiguous beginning delimiter (or an unambiguous ending delimiter) doesn't exist.
Examples of a quote jumping out of its own scope (which are illegal strings) would be:
made for each specific programming language to prevent ambiguity, such as the tactic of reading expressions from the leftmost character to the rightmost character, giving a left ^ symbol a higher precedence (i.e., priority) over any other quotation mark that comes after it (on the same line).
As mentioned earlier, nested quotes can be valid, such as in , but they require some method that employs a hierarchy (typically a stack data structure), regardless of whether that hierarchy uses unique delimiting characters, escape characters, or repeated-character delimiters.
"Scope-mixing" is a specific case of Delimiter Collision, explained later in this page.
- title: An example multi-line string in YAML
body : |
This is a multi-line string.
"special" metacharacters may
appear here. The extent of this string is
represented by indentation.
%map = (red => 0x00f, blue => 0x0f0, green => 0xf00);
Perl treats a non-reserved sequence of alphanumeric characters as string literal in most contexts. For example, the following two lines of Perl are equivalent:
35HAn example Hollerith string literal
A drawback of this technique is that it is relatively error-prone unless length insertion is automated, especially for multi-byte encodings. Advantages include: alleviates need to search for the end delimiter and therefore requires less computational overhead, prevents delimiter collision issues and enables the inclusion of that might otherwise be mistaken as commands
Paired quotes, such as braces in Tcl, allow nested strings, such as "This is \"in quotes\" and properly escaped." but do not otherwise solve the problem of delimiter collision, since an unbalanced closing delimiter cannot simply be included, as in {foo {bar} zork}.
'This Pascal string''contains two apostrophes'''
"I said, ""Can you hear me?"""
"This is John's apple."
'I said, "Can you hear me?"'
This does not allow having a single literal with both delimiters in it, however. This can be worked around by using several literals and using string concatenation:
'I said, "This is ' + "John's" + ' apple."'
'I said, "This is '"John's"' apple."'
D supports a few quoting delimiters, with such strings starting with " plus an opening delimiter and ending with the respective closing delimiter and q". Available delimiter pairs are ", (), <>, and {}; an unpaired non-identifier delimiter is its own closing delimiter. The paired delimiters nest, so that is a valid literal; an example with the non-nesting [] character is .
Similar to C++11, D allows here-document-style literals with end-of-string ids:
In some programming languages, such as Bourne shell and Perl, there are different delimiters that are treated differently, such as doing string interpolation or not, and thus care must be taken when choosing which delimiter to use; see the section on different kinds of strings below.
For example, in Perl:
all produce the desired result. Although this notation is more flexible, few languages support it; other than Perl, Ruby (influenced by Perl) and C++11 also support these. A variant of multiple quoting is the use of here document-style strings.
Lua (as of 5.1) provides a limited form of multiple quoting, particularly to allow nesting of long comments or embedded strings. Normally one uses q" ''end-of-string-id'' newline ''content'' newline ''end-of-string-id'' " and to delimit literal strings (initial newline stripped, otherwise raw), but the opening brackets can include any number of equal signs, and only closing brackets with the same number of signs close the string. For example:
Multiple quoting is particularly useful with regular expressions that contain usual delimiters such as quotes, as this avoids needing to escape them. An early example is sed, where in the substitution command the default slash s/'''regex'''/'''replacement'''/ delimiters can be replaced by another character, as in / .
For example, early forms of BASIC did not include escape sequences or any other workarounds listed here, and thus one instead was required to use the s,'''regex''','''replacement''', function, which returns a string containing the character corresponding to its argument. In ASCII the quotation mark has the value 34, so to represent a string with quotes on an ASCII system one would write
These constructor functions can also be used to represent nonprinting characters, though escape sequences are generally used instead. A similar technique can be used in C++ with the %c stringification operator.
One character is chosen as a prefix to give encodings for characters that are difficult or impossible to include directly. Most commonly this is backslash; in addition to other characters, a key point is that backslash itself can be encoded as a double backslash std::string and for delimited strings the delimiter itself can be encoded by escaping, say by \\ for ". A regular expression for such escaped strings can be given as follows, as found in the ANSI C specification:
"meaning "a quote; followed by zero or more of either an escaped character (backslash followed by something, possibly backslash or quote), or a non-escape, non-quote character; ending in a quote" – the only issue is distinguishing the terminating quote from a quote preceded by a backslash, which may itself be escaped. Multiple characters can follow the backslash, such as \", depending on the escaping scheme.(\\.|[^\\"])* "
An escaped string must then itself be lexical analysis, converting the escaped string into the unescaped string that it represents. This is done during the evaluation phase of the overall lexing of the computer language: the evaluator of the lexer of the overall language executes its own lexer for escaped string literals.
Among other things, it must be possible to encode the character that normally terminates the string constant, plus there must be some way to specify the escape character itself. Escape sequences are not always pretty or easy to use, so many compilers also offer other means of solving the common problems. Escape sequences, however, solve every delimiter problem and most compilers interpret escape sequences. When an escape character is inside a string literal, it means "this is the start of the escape sequence". Every escape sequence specifies one character which is to be placed directly into the string. The actual number of characters required in an escape sequence varies. The escape character is on the top/left of the keyboard, but the editor will translate it, therefore it is not directly tapeable into a string. The backslash is used to represent the escape character in a string literal.
Many languages support the use of inside string literals. Metacharacters have varying interpretations depending on the context and language, but are generally a kind of 'processing command' for representing printing or nonprinting characters.
For instance, in a C string literal, if the backslash is followed by a letter such as "b", "n" or "t", then this represents a nonprinting backspace, newline or tab character respectively. Or if the backslash is followed by 1-3 octal digits, then this sequence is interpreted as representing the arbitrary code unit with the specified value in the literal's encoding (for example, the corresponding ASCII code for an ASCII literal). This was later extended to allow more modern hexadecimal character code notation:
| null character (typically as a special case of \ooo octal notation) |
| alert |
| backspace |
| form feed |
| line feed (or newline in POSIX) |
| carriage return (or newline in Mac OS 9 and earlier) |
| horizontal tab |
| vertical tab |
| escape character (GCC, clang and tcc) |
| 16-bit Unicode character where #### are four hex digits |
| 32-bit Unicode character where ######## are eight hex digits (Unicode character space is currently only 21 bits wide, so the first two hex digits will always be zero) |
| 21-bit Unicode character where ###### is a variable number of hex digits |
| 8-bit character specification where # is a hex digit. The length of a hex escape sequence is not limited to two digits, instead being of an arbitrary length. |
| 8-bit character specification where o is an octal digit |
| double quote (") |
| non-character used to delimit numeric escapes in Haskell |
| single quote (') |
| backslash (\) |
| question mark (?) |
Note: Not all sequences in the list are supported by all parsers, and there may be other escape sequences which are not in the list.
Incorrect quoting of nested strings can present a security vulnerability. Use of untrusted data, as in data fields of an SQL query, should use prepared statements to prevent a code injection attack. In PHP 2 through 5.3, there was a feature called magic quotes which automatically escaped strings (for convenience and security), but due to problems was removed from version 5.4 onward.
Raw strings are particularly useful when a common character needs to be escaped, notably in regular expressions (nested as string literals), where backslash \uFFFF is widely used, and in DOS/Windows paths, where backslash is used as a path separator. The profusion of backslashes is known as leaning toothpick syndrome, and can be reduced by using raw strings. Compare escaped and raw pathnames in C#:
"The Windows path is C:\\Foo\\Bar\\Baz\\"
@"The Windows path is C:\Foo\Bar\Baz\"
In XML documents, CDATA sections allows use of characters such as & and < without an XML parser attempting to interpret them as part of the structure of the document itself. This can be useful when including literal text and scripting code, to keep the document well formed.
foo
bar
Languages that allow literal newlines include BASH, Lua, Perl, PHP, R lang, and Tcl. In some other languages string literals cannot include newlines.
Two issues with multiline string literals are leading and trailing newlines, and indentation. If the initial or final delimiters are on separate lines, there are extra newlines, while if they are not, the delimiter makes the string harder to read, particularly for the first line, which is often indented differently from the rest. Further, the literal must be unindented, as leading whitespace is preserved – this breaks the flow of the code if the literal occurs within indented code.
The most common solution for these problems is here document-style string literals. Formally speaking, a here document is not a string literal, but instead a stream literal or file literal. These originate in shell scripts and allow a literal to be fed as input to an external command. The opening delimiter is \n where <<END can be any word, and the closing delimiter is END on a line by itself, serving as a content boundary – the END is due to redirecting stdin from the literal. Due to the delimiter being arbitrary, these also avoid the problem of delimiter collision. These also allow initial tabs to be stripped via the variant syntax << though leading spaces are not stripped. The same syntax has since been adopted for multiline string literals in a number of languages, most notably Perl, and are also referred to as here documents, and retain the syntax, despite being strings and not involving redirection. As with other string literals, these can sometimes have different behavior specified, such as variable interpolation.
Python, whose usual string literals do not allow literal newlines, instead has a special form of string, designed for multiline literals, called triple quoting. These use a tripled delimiter, either <<-END or <nowiki></nowiki>. These literals are especially used for inline documentation, known as .
Tcl allows literal newlines in strings and has no special syntax to assist with multiline strings, though delimiters can be placed on lines by themselves and leading and trailing newlines stripped via """, while string trim can be used to strip indentation.
In practical terms, this allows string concatenation in early phases of compilation ("translation", specifically as part of lexical analysis), without requiring phrase analysis or constant folding. For example, the following is valid C:
This is particularly important when used in combination with the C preprocessor, to allow strings to be computed following preprocessing, particularly in macros. As a simple example:
A more complex example uses stringification of integers (by the preprocessor) to define a macro that expands to a sequence of string literals, which are then concatenated to a single string literal with the file name and line number:
Beyond syntactic requirements of C/C++, implicit concatenation is a form of syntactic sugar, making it simpler to split string literals across several lines, avoiding the need for line continuation (via backslashes) and allowing one to add comments to parts of strings. For example, in Python, one can comment a regular expression in this way:The Python Language Reference, 2. Lexical analysis, 2.4.2. String literal concatenation: "This feature can be used to reduce the number of backslashes needed, to split long strings conveniently across long lines, or even to add comments to parts of strings, for example:
pattern: Pattern = re.compile(
A subtler issue is that in C and C++,C++11 draft standard, , 2.14.5 String literals lex.string, note 13, p. 28–29: "Any other concatenations are conditionally supported with implementation-defined behavior." there are different types of string literals, and concatenation of these has implementation-defined behavior, which poses a potential security risk.
One of the oldest examples is in shell scripts, where single quotes indicate a raw string or "literal string", while double quotes have escape sequences and variable interpolation.
For example, in Python, raw strings are preceded by an scanf() or r – compare R with 'C:\\Windows' (though, a Python raw string cannot end in an odd number of backslashes). Python 2 also distinguishes two types of strings: 8-bit ASCII ("bytes") strings (the default), explicitly indicated with a r'C:\Windows' or b prefix, and Unicode strings, indicated with a B or u prefix. while in Python 3 strings are Unicode by default and bytes are a separate U type that when initialized with quotes must be prefixed with a bytes.
C#'s notation for raw strings is called @-quoting.
C++11 allows raw strings, unicode strings (UTF-8, UTF-16, and UTF-32), and wide character strings, determined by prefixes. It also adds literals for the existing C++ b, which is generally preferred to the existing C-style strings.
In Tcl, brace-delimited strings are literal, while quote-delimited strings have escaping and interpolation.
Perl has a wide variety of strings, which are more formally considered operators, and are known as quote and quote-like operators. These include both a usual syntax (fixed delimiters) and a generic syntax, which allows a choice of delimiters. These include:
REXX uses suffix characters to specify characters or strings using their hexadecimal or binary code. E.g.,
For example, the following Perl code:
produces the output:
In this case, the metacharacter character ($) (not to be confused with the sigil in the variable assignment statement) is interpreted to indicate variable interpolation, and requires some escaping if it needs to be outputted literally.
This should be contrasted with the 'R' function, which produces the same output using notation such as:
but does not perform interpolation: the f is a placeholder in a printf format string, but the variables themselves are outside the string.
This is contrasted with "raw" strings:
which produce output like:
Here the $ characters are not metacharacters, and are not interpreted to have any meaning other than plain text.
For example:
Nevertheless, some languages are particularly well-adapted to produce this sort of self-similar output, especially those that support multiple options for avoiding delimiter collision.
Using string literals as code that generates other code may have adverse security implications, especially if the output is based at least partially on untrusted user input. This is particularly acute in the case of Web-based applications, where malicious users can take advantage of such weaknesses to subvert the operation of the application, for example by mounting an SQL injection attack.
"[A-Za-z_]" # letter or underscore
"[A-Za-z0-9_]*" # letter, digit or underscore
)
Problems
Different kinds of strings
String interpolation
Nancy said Hello World to the crowd of people.
$name said $greeting to the crowd of people.
Embedding source code in string literals
See also
Notes
External links
|
|